Systems and methods for extracting phases from text

ABSTRACT

Systems and methods for extracting phrases from text are disclosed. In an exemplary embodiment, a method may include preprocessing desired phrases into at least one phrase indexing data structure for efficient matching. The method may also include scanning text to construct a hash table including keys and corresponding entries. The method may also include locating suffix trie trees for each word in the hash table. The method may also include matching each position in the hash table against the suffix trie trees, and outputting phrases matched in the scanned text.

BACKGROUND

Although there are a large number of websites on the Internet or WorldWide Web (www), users often are only interested in information onspecific web pages from some websites. Given the sheer size of Internet,it has become increasingly important to tailor searches and productrecommendations to the user's personal interests/preferences. It isburdensome for both the user and the entity seeking information to askthe user to specify his/her personal interests/preferences. In addition,those interests/preferences may change over time. It is thereforegenerally considered more convenient to automatically discover userinterests/preferences from the web pages the user visits. To enable sucha tool, information contained in web pages visited by the user needs tobe identified and extracted.

This information may include extracting movie titles, entity names(e.g., sports teams), or other information from web or text documents.On the surface, the problem of matching movie titles (or otherinformation) appears to be a string matching problem. But actually, ithas special characteristics and the standard string matching algorithmsdo not work well.

There are several well-known algorithms for string matching. TheKnuth-Pratt-Morris and Boyer-Moore algorithms match a single stringpattern against an input string. The Aho-Corasick and Set-wiseBoyer-Moore algorithms match multiple patterns. When applied to humanlanguage text, these algorithms usually work at character level. Thatis, the basic elements that form the patterns and the input text areconsidered to be characters. This means that the alphabet (the set ofelements) is relatively small. In the case of English, there aretwenty-six letters, plus a few other characters (e.g., space andapostrophe). Some of the algorithms take advantage of the small size ofthe English alphabet by pre-computing the actions for each element. Forexample, the Boyer-Moore algorithm pre-computes, for each element in thealphabet, the number of characters in the input text that can be skippedand stores these skip values in a table. However, these algorithms arelimited to use with a specific language. In addition, a large number ofcomparisons are needed to effectively extract the desired informationand therefore the algorithms are generally considered to be inefficientfor these purposes.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a high-level diagram of an exemplary networked computer systemin which systems and methods for extracting phrases from text may beimplemented.

FIG. 2 is a flowchart illustrating exemplary operations for extractingphrases from text.

FIG. 3 is a flowchart illustrating exemplary matching operations in moredetail.

FIG. 4 is a flowchart illustrating additional exemplary preprocessingoperations.

DETAILED DESCRIPTION

Systems and methods for extracting multi-word patterns or “phrases” froma text (e.g., a web page or document) are disclosed. In exemplaryembodiments, a database is constructed with the desired phrase, and insome embodiments, related phrases that often appear in the same textwhen the desired phrase is used. The systems and methods may then beimplemented for matching phrases in the text.

Exemplary systems and methods are described herein with reference to theparticular application of extracting movie titles from text. By knowingwhich movie the user is interested in, links can then be provided forthe user to, e.g., rent the movie, read reviews, or discover similarmovies. It is noted, however, that the systems and methods disclosedherein may also be implemented for a wide variety of other applications(e.g., for extracting from text song names, sports team names,companies, schools, government agencies, etc.).

The disclosed multi-word pattern matching approach is different from thestandard string matching problem. Take the English language as anexample. At the word level, it is well known that the frequency of wordsfollows a Zipf distribution. Zipf's law states that, in a structured setof texts of natural language the frequency of any word is inverselyproportional to its rank in frequency. Accordingly, the most frequentword occurs twice as often as the second most frequent word, whichoccurs twice as often as the fourth most frequent word, and so forth.

The systems and methods disclosed herein take advantage of thisdistribution by comparing each unique word (not each occurrence of aword) against the least frequent word in the phrase (e.g., a movietitle). This approach greatly reduces the number of comparisons neededto identify the phrase in the text. In addition, when working at theword level, the set of elements in the patterns and text are all thepossible words in one or more language. Accordingly, the systems andmethods described herein can work with multiple languages at the sametime.

In addition, text can be compared against a very large set of phrases.This large element set is handled by using hash tables to store themapping from the words to data structures (st(w), pt(w) and hash table),making the comparison very efficient. This set-wise, multi-word matchingproblem also enables special characteristics (e.g., word frequencycomparisons) that the string matching algorithms do not take advantageof.

FIG. 1 is a high-level illustration of an exemplary networked computersystem 100 (e.g., via the Internet) in which systems and methods forextracting phrases from text may be implemented. The networked computersystem 100 may include one or more communication networks 110, such as alocal area network (LAN) and/or wide area network (WAN), for connectingone or more websites 120 at one or more host 130 (e.g., servers 130 a-c)to one or more user 140 (e.g., client computers 140 a-c).

The term “client” as used herein (e.g., client computers 140 a-c) refersto one or more computing devices through which one or more users 140 mayaccess the network 110. Clients may include any of a wide variety ofcomputing systems, such as a stand-alone personal desktop or laptopcomputer (PC), workstation, personal digital assistant (PDA), orappliance, to name only a few examples. Each of the client computingdevices may include memory, storage, and a degree of data processingcapability at least sufficient to manage a connection to the network110, either directly or indirectly. Client computing devices may connectto network 110 via a communication connection, such as a dial-up, cable,or DSL connection via an Internet service provider (ISP).

The operations described herein may be implemented by the host 130(e.g., servers 130 a-c which also host the website 120) or by a thirdparty system 150 (e.g., servers 150 a-c) in the networked computersystem 100. In either case, the servers may execute program code whichenables comparison of phrases in one or more documents (e.g., a web pagein website 120). The results may then be used to assist the user 140.

The term “server” as used herein (e.g., servers 130 a-c or servers 150a-c) refers to one or more computing systems with computer-readablestorage. The server may be provided on the network 110 via acommunication connection, such as a dial-up, cable, or DSL connectionvia an Internet service provider (ISP). The server may be accesseddirectly via the network 110, or via a network site. In an exemplaryembodiment, the website 120 may also include a web portal on athird-party venue (e.g., a commercial Internet site) which facilitates aconnection for one or more servers via a back-end link or other directlink. The servers may also provide services to other computing or dataprocessing systems or devices. For example, the servers may also providetransaction processing services for users 140.

When the server is “hosting” the website 120, it is referred to hereinas the host 130 regardless of whether the server is from the cluster ofservers 130 a-c or the cluster of servers 150 a-c. Likewise, when theserver is executing program code for extracting phrases from text, it isreferred to herein as the server 150 regardless of whether the server isfrom the cluster of servers 130 a-c or the cluster of servers 150 a-c.

The program code may execute the exemplary operations described hereinfor extracting phrases from text. In exemplary embodiments, theoperations may be embodied as logic instructions on one or morecomputer-readable medium. When executed on a processor, the logicinstructions cause a general purpose computing device (e.g., server 150)to be programmed as a special-purpose machine that implements thedescribed operations.

For purposes of illustration, the components and connections depicted inthe figures may be used to extract movie titles from text (e.g.,documents or web pages on the Internet 110), although the systems andmethods are not limited to such an application. By knowing which moviethe user is interested in, links can be provided for the user toconveniently rent the movie, to read reviews of the movie, discoversimilar movies, etc.

There are already fairly complete movie databases 155 available, such asthe Internet Movie Database (IMDB). These databases typically containthe movie title, directors, and actors/actresses. Other information mayalso be provided by these databases 155. There are two major challengesin efficient and accurate extraction. First, there are a large number ofmovies. For example, the IMDB database contains more than 120,000movies. It is very expensive to search for each movie title that may becontained in the text. Second, a movie title is often a meaningful wordor phrase which needs to be distinguished from other words in the text.And when multiple movies have the same name, it needs to be determinedwhich movie, if any at all, the title really corresponds to. Exemplaryoperations for enabling these features are described in more detail nowwith reference to FIGS. 2-4.

FIG. 2 is a flowchart illustrating exemplary operations 200 forextracting phrases from text. First, the movie titles are preprocessed(operations 210-250) into one or more phrase indexing data structuresfor efficient matching. Then those phrase indexing data structures areused to expedite the matching process (FIG. 3). In operation 210, akeyword is identified for each movie title. In exemplary embodiments,the keyword is the word that appears least frequent in a typical text.

In operation 220, all of the phrases are grouped by keyword. Forexample, all the movie titles (t_l, t_2, . . . t_k) with the samekeyword w are grouped. In operation 230, each phrase is broken at thefirst encountered keyword. For example, break each ti at the firstencountered keyword w in t_i, obtaining p_i and s_i. That is, t_i=p_i •w • s_i, where • indicates concatenation. In operation 240, suffix trietrees st(w) and pt(w) are built. For example, a suffix trie tree st(w)is built at the word level, on s_i′s for i=1 to k. Then the order ofwords in p_i is reversed, obtaining q_i, and a suffix trie tree pt(w) isbuilt on q_i's for i=1 to k. In operation 250, the suffix trie treesst(w) and pt(w) are stored in the phrase indexing data structures andare indexed by keyword w. The phrase indexing data structures serve asan overall index of the document. In operation 260, matching maycommence, as described in more detail with reference to FIG. 3.

FIG. 3 is a flowchart illustrating exemplary matching operations 300 inmore detail. In operation 310, text is scanned (e.g., sequentially). Inoperation 320, a hash table is constructed in memory. The hash tableincludes keys (e.g., the words contained in the text) and entries (e.g.,the list of positions of the words in the text). The hash table is usedin operations 330 and 340.

In operation 330, the suffix trie trees st(w) and pt(w) in the phraseindexing data structures built in preprocessing are located for eachword w in the hash table. In operation 340, each position in the hashtable is matched against the suffix trie tree st(w) (wherein the matchedset is referred to as L_1) and against the suffix trie tree pt(w)(wherein the matched set is referred to as L_2). An entry of the hashtable records all the positions of a word's appearance in the document.By checking each position, the algorithm basically goes through thewhole document. The specific manner in which the document is indexed isnot important and any suitable process may be implemented. In operation350, phrases matched in the text are output. Continuing the aboveexample, the intersection between L_1 and L_2 is the set of titlesmatched in the document.

Other embodiments are also contemplated. For example, preprocessing maybe modified to improve the accuracy of matches by reducing falsepositives. In an exemplary embodiment, director and actor/actressinformation contained in the movie databases may be utilized in thepreprocessing as described in more detail with reference to FIG. 4.

FIG. 4 is a flowchart illustrating additional exemplary preprocessingoperations. The basic idea is that when searching for a movie title,director/actor/actress name associated with the movie may also be used.By finding more relevant names, there is a higher confidence that thetitle is a true match.

In operation 410, a list of related phrases is obtained from thedatabase (e.g., associated people such as director/actor/actress foreach movie). In operation 420, the relevance of the related phrases tothe desired phrases is determined. For example, if the related phrasesinclude persons, the relevance of each person is identified for eachmovie. In many cases, the relevance is already indicated in the moviedatabase. When it is not, heuristics may be used to determine therelevance (e.g. the number of movies each person is associated to).

In operation 430, top related phrases are selected for each movieaccording to the relevance (e.g., x directors, y actors, and zactresses, where x, y, and z are some predefined constants such as x=1,and y=z=2). Either the top related phrases, or those meeting apredetermined threshold are selected In an exemplary embodiment, thelast name is indexed by the movie title. When a title is matched in thetext, the list of related phrases (e.g., names) is retrieved andsearched in the text. The confidence is then given by the number oftimes each related phrase (e.g., names) appears in the document. Forexample, the billing position may be used to determine the top relatedphrase wherein the billing position correlates to the importance of eachrole in the IMDB database. If billing position is not available, aweighted sum of the number of movies and TV events each actor/actressappears in may be used. Still other methods may also be implemented. Inoperation 440, matched results may be filtered using the top relatedphrases, e.g., to reduce false positives. Other factors may also be usedto help decide whether a particular phrase refers to a movie title. Forexample, in a web document the number of overlapping words between amovie title and the HTML title may be used. Or for example, the numberof times a movie title appears in the text may be used. Each factor maybe assigned a weight and combined using a heuristic formula.

It is understood that the embodiments shown and described herein areintended only for purposes of illustration of exemplary systems andmethods and are not intended to be limiting. In addition, the operationsand examples shown and described herein are provided to illustrateexemplary implementations for extracting phrases from text. It is notedthat the operations are not limited to those shown. Other operations mayalso be implemented. Still other embodiments of for extracting phrasesfrom text are also contemplated, as will be readily appreciated by thosehaving ordinary skill in the art after becoming familiar with theteachings herein.

In addition to the specific embodiments explicitly set forth herein,other aspects and implementations will be apparent to those skilled inthe art from consideration of the specification disclosed herein.

The invention claimed is:
 1. A method for extracting phrases from text,comprising, preprocessing desired phrases into at least one phraseindexing data structure for efficient matching; during preprocessingbuilding suffix trie trees, wherein one of the suffix trite trees isbuilt at a word level, and then an order of words is reversed to buildanother one of the suffix tile trees; after preprocessing, scanning textto construct a hash table including keys and corresponding entries;locating suffix trie trees in the at least one phrase indexing datastructure for each word in the hash table; matching each position in thehash table against the suffix trie trees; and outputting phrases matchedin the scanned text.
 2. The method of claim 1 wherein preprocessingfurther includes identifying a keyword for each of the desired phrases.3. The method of claim 2 wherein the keyword is a word appearing leastfrequent in a typical text.
 4. The method of claim 2 whereinpreprocessing further includes grouping the desired phrases by keyword.5. The method of claim 2 wherein preprocessing further includes breakingeach phrase at a first encountered keyword.
 6. The method of claim 1wherein preprocessing further includes building the suffix trie treesand storing the suffix trie trees in the at least one phrase indexingdata structure.
 7. The method of claim 1 wherein preprocessing furthercomprises reducing false positives using related phrases.
 8. The methodof claim 7 further comprising determining relevance of the relatedphrases to the desired phrases.
 9. The method of claim 1 furthercomprising selecting top related phrases for each desired phrase basedon relevance.
 10. The method of claim 1 further comprising duringpreprocessing: grouping all the desired phrases by keyword; breakingeach phrase at a first encountered keyword obtaining p_i and s_i;building suffix trie trees st(w) and pt(w), wherein the suffix trie treest(w) is built at a word level and then order of words in p_i isreversed to obtain q_i, wherein the suffix trie tree pt(w) is but onq_i; indexing the suffix trie trees st(w) and pt(w) by keyword.
 11. Themethod of claim 1 wherein the at least one phrase indexing datastructure is an overall index of a document.
 12. A system for extractingphrases from text, comprising: at least one phrase indexing datastructure residing in non-transitory computer readable media, the atleast one phrase indexing data structure including desired phrases;program code stored on non-transitory computer readable media andexecutable by a processor for improving accuracy of matches in scannedtext by reducing false positives using phrases related to the scannedtext, the preprocessing program code building suffix trie trees, whereinone of the suffix trie trees is built at a word level, and then an orderof words is reversed to build another one of the suffix trie trees; ahash table constructed in non-transitory computer readable media, thehash table including keys and corresponding entries; and program codestored on non-transitory computer readable media and executable by aprocessor for locating suffix trie trees for each word in the hash tableand matching each position in the hash table against the suffix trietrees to match phrases in the scanned text.
 13. The system of claim 12wherein the keyword for each of the desired phrases is identified. 14.The system of claim 12 wherein the keyword is a word appearing leastfrequent in a typical text.
 15. The system of claim 14 wherein thedesired phrases are grouped by keyword.
 16. The system of claim 14wherein each phrase is broken at a first encountered keyword.
 17. Thesystem of claim 12 wherein the hash table includes keys andcorresponding entries, the keys being words contained in the text andthe corresponding entries being the list of positions of the words inthe text.
 18. The system of claim 12 wherein the program code executes apreprocessing operation to improve accuracy of matches by reducing falsepositives using related phrases.
 19. The system of claim 12 whereinphrases extracted from text are movie titles, book titles, album names,or sport team names.
 20. The system of claim 12 wherein a desired phraseis extracted from a large database.
 21. A system for extracting movietitles from web-based text, comprising: memory means for storing atleast one phrase indexing data structure including desired movie titles;preprocessing means for improving accuracy of matches of movie titles inthe web-based text by reducing false positives using phrases related tothe movie titles, the preprocessing means building suffix trie trees,wherein one of the suffix trie trees is built at a word level, and thenan order of words is reversed to build another one of the suffix trietrees; table means for storing keys and corresponding entries, the keysrepresenting words in the web-based text, and the entries representing alist of positions of the words in the web-based text; and means forlocating suffix trie trees for each word in the table means and matchingeach position in the table means against the suffix trie trees to matchmovie titles in a scanned web-based text.